1 Overview

1.1 RStudio Workspace

1.2 Contents of a Rmarkdown {.Rmd}

1.3 Knitting a document

  • Knitting
  • Preview in Viewer Pane

2 Setup

2.1 Loading Packages

  1. ONLY ONCE – Install Package
  • Type: install.packages("put_package_name_here") into the console
  • e.g. install.packages("tidyverse")
  1. EVERY TIME YOU WANT TO USE – Load Package
  • Type: library("put_package_name_here") into your code chunk
  • e.g. library("tidyverse")
# Load Relevant Packages
library(tidyverse) # piping `%>%`, plotting, reading data
library(skimr) # exploratory data summary
library(naniar) # exploratory plots
library(kableExtra) # tables
library(lubridate) # for date variables
library(plotly)

# Extension
  # Try to save time by installing a package to install packages!
  # install.packages("pacman")
  # pacman::p_load(tidyverse, skimr, naniar, kableExtra, lubridate, plotly)

2.2 Loading Data

If the data is online:

This is easy for datasets that are not too large & already hosted online!

If the data is a local file:

  1. File Management
  • In general, put your data file in the same folder as your R file.
  1. Setting working directory
  • If you are knitting your document: it will automatically look for data in the same folder as your R file. So you should have no dramas (per step 1.).

  • If you are running a section of code: you will need to specify which folder the data is in. The best way to do this is by following these menu options: Session Menu >> Set Working Directory >> To Source File Location.

  1. Load Data
  • Read data using the read_csv() function which comes from the tidyverse package.
# Load Data
data = read_csv("https://www.maths.usyd.edu.au/u/UG/OL/OLEO5605/r/NYC_Drinking_Water.csv")
#data = read_csv("Drinking_Water_Quality_Distribution_Monitoring_Data.csv")

3 Exploratory Data Analysis

3.1 Quick Snapshot

# Glimpse Function [From tidyverse package]
data %>% glimpse()
## Rows: 72,709
## Columns: 10
## $ `Sample Date`                         <chr> "01/01/2015", "01/01/2015", "01…
## $ `Sample Time`                         <time> 12:19:00, 11:15:00, 10:09:00, …
## $ `Sample Site`                         <chr> "1S07", "1S04", "1S03A", "1S03B…
## $ `Sample class`                        <chr> "Operational", "Operational", "…
## $ Location                              <chr> "SS - Shaft 7 of City Tunnel No…
## $ `Residual Free Chlorine (mg/L)`       <dbl> 0.58, 0.71, 0.79, 0.77, 0.74, 0…
## $ `Turbidity (NTU)`                     <dbl> 0.96, 0.94, 0.93, 0.93, 0.95, 1…
## $ `Fluoride (mg/L)`                     <dbl> 0.79, 0.80, 0.79, 0.80, NA, NA,…
## $ `Coliform (Quanti-Tray) (MPN /100mL)` <chr> "<1", "<1", "<1", "<1", "<1", "…
## $ `E.coli(Quanti-Tray) (MPN/100mL)`     <chr> "<1", "<1", "<1", "<1", "<1", "…
# Skim Function [From skimr package]
data %>% skim()
Data summary
Name Piped data
Number of rows 72709
Number of columns 10
_______________________
Column type frequency:
character 6
difftime 1
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Sample Date 0 1 10 10 0 1673 0
Sample Site 0 1 4 5 0 398 0
Sample class 0 1 9 20 0 7 0
Location 0 1 29 152 0 1263 0
Coliform (Quanti-Tray) (MPN /100mL) 0 1 1 6 0 47 0
E.coli(Quanti-Tray) (MPN/100mL) 0 1 1 2 0 3 0

Variable type: difftime

skim_variable n_missing complete_rate min max median n_unique
Sample Time 2 1 17520 secs 57780 secs 10:10:00 463

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Residual Free Chlorine (mg/L) 3 1.00 0.58 0.21 -9.99 0.45 0.59 0.72 2.20 ▁▁▁▁▇
Turbidity (NTU) 506 0.99 0.74 0.28 0.07 0.62 0.73 0.86 33.80 ▇▁▁▁▁
Fluoride (mg/L) 63336 0.13 0.71 0.06 0.03 0.69 0.71 0.73 0.89 ▁▁▁▇▇
# Summary Function [From base package -- preinstalled!]
data %>% summary()
##  Sample Date        Sample Time       Sample Site        Sample class      
##  Length:72709       Length:72709      Length:72709       Length:72709      
##  Class :character   Class1:hms        Class :character   Class :character  
##  Mode  :character   Class2:difftime   Mode  :character   Mode  :character  
##                     Mode  :numeric                                         
##                                                                            
##                                                                            
##                                                                            
##    Location         Residual Free Chlorine (mg/L) Turbidity (NTU)  
##  Length:72709       Min.   :-9.9900               Min.   : 0.0700  
##  Class :character   1st Qu.: 0.4500               1st Qu.: 0.6200  
##  Mode  :character   Median : 0.5900               Median : 0.7300  
##                     Mean   : 0.5845               Mean   : 0.7385  
##                     3rd Qu.: 0.7200               3rd Qu.: 0.8600  
##                     Max.   : 2.2000               Max.   :33.8000  
##                     NA's   :3                     NA's   :506      
##  Fluoride (mg/L) Coliform (Quanti-Tray) (MPN /100mL)
##  Min.   :0.03    Length:72709                       
##  1st Qu.:0.69    Class :character                   
##  Median :0.71    Mode  :character                   
##  Mean   :0.71                                       
##  3rd Qu.:0.73                                       
##  Max.   :0.89                                       
##  NA's   :63336                                      
##  E.coli(Quanti-Tray) (MPN/100mL)
##  Length:72709                   
##  Class :character               
##  Mode  :character               
##                                 
##                                 
##                                 
## 

3.2 Exploring missingness

# vis_miss function [From visdat or naniar packages]
vis_miss(data, warn_large_data = FALSE)

3.3 Exploring numeric variables

  • What do you notice about outliers & skewness?
data %>%
  select(where(is.numeric)) %>%
  pivot_longer(everything(),
               names_to = "Variable",
               values_to = "Value") %>%
  group_by(Variable) %>%
  summarise(Mean = mean(Value, na.rm = TRUE),
            Median = median(Value, na.rm = TRUE)) %>%
  kable() %>% # putting into a table
  kable_styling(bootstrap_options = c("hover")) # making table look good
## `summarise()` ungrouping output (override with `.groups` argument)
Variable Mean Median
Fluoride (mg/L) 0.7144069 0.71
Residual Free Chlorine (mg/L) 0.5844564 0.59
Turbidity (NTU) 0.7385309 0.73
p = data %>%  
  select(where(is.numeric)) %>%
  pivot_longer(everything(), 
               names_to = "Variable", 
               values_to = "Value") %>% 
  filter(!is.na(Value)) %>%
  ggplot(aes(x = Value)) +
  facet_wrap(~ Variable, scales = "free_y") +
  geom_histogram(bins = 30, fill = "lightblue", color = "darkblue") +
  labs(y = "Frequency") +
  theme_bw()

ggplotly(p)
p = data %>%  
  select(where(is.numeric)) %>%
  pivot_longer(everything(), 
               names_to = "Variable", 
               values_to = "Value") %>% 
  filter(!is.na(Value)) %>%
  ggplot(aes(y = Value)) +
  facet_wrap(~ Variable, scales = "free_y") +
  geom_boxplot(fill = "lightblue", color = "darkblue") +
  theme_bw() +
  theme(strip.text.x = element_text(size = 8))

p

4 Data Cleaning

# See what state things are currently in
data %>% glimpse()
## Rows: 72,709
## Columns: 10
## $ `Sample Date`                         <chr> "01/01/2015", "01/01/2015", "01…
## $ `Sample Time`                         <time> 12:19:00, 11:15:00, 10:09:00, …
## $ `Sample Site`                         <chr> "1S07", "1S04", "1S03A", "1S03B…
## $ `Sample class`                        <chr> "Operational", "Operational", "…
## $ Location                              <chr> "SS - Shaft 7 of City Tunnel No…
## $ `Residual Free Chlorine (mg/L)`       <dbl> 0.58, 0.71, 0.79, 0.77, 0.74, 0…
## $ `Turbidity (NTU)`                     <dbl> 0.96, 0.94, 0.93, 0.93, 0.95, 1…
## $ `Fluoride (mg/L)`                     <dbl> 0.79, 0.80, 0.79, 0.80, NA, NA,…
## $ `Coliform (Quanti-Tray) (MPN /100mL)` <chr> "<1", "<1", "<1", "<1", "<1", "…
## $ `E.coli(Quanti-Tray) (MPN/100mL)`     <chr> "<1", "<1", "<1", "<1", "<1", "…
# Data Cleaning

## Changing type of `Sample Date`
  # Note: mdy from [From tidyverse package]
data = data %>% 
  mutate(`Sample Date` = mdy(`Sample Date`))

# Adding additional time columns
data = data %>% 
        mutate(`Week of Year` = week(`Sample Date`),  
               `Weekday` = wday(`Sample Date`), 
               `Month Number` = month(`Sample Date`),
               `Hour` = hour(`Sample Time`))

# Giving Month Name an Order
data = data %>% 
  mutate(`Month` = factor(month.name[`Month Number`], 
                          levels = month.name))

# Converting Categorical Variables to Factors
data = data %>% 
  mutate(across(where(is_character) & !c(Location, `Sample Site`), 
         as_factor))

# Drop NA -- Q: Do you think this is appropriate?
# data = data %>%
#   drop_na()
# Check we're happy with cleaned data
data %>% glimpse()
## Rows: 72,709
## Columns: 15
## $ `Sample Date`                         <date> 2015-01-01, 2015-01-01, 2015-0…
## $ `Sample Time`                         <time> 12:19:00, 11:15:00, 10:09:00, …
## $ `Sample Site`                         <chr> "1S07", "1S04", "1S03A", "1S03B…
## $ `Sample class`                        <fct> Operational, Operational, Opera…
## $ Location                              <chr> "SS - Shaft 7 of City Tunnel No…
## $ `Residual Free Chlorine (mg/L)`       <dbl> 0.58, 0.71, 0.79, 0.77, 0.74, 0…
## $ `Turbidity (NTU)`                     <dbl> 0.96, 0.94, 0.93, 0.93, 0.95, 1…
## $ `Fluoride (mg/L)`                     <dbl> 0.79, 0.80, 0.79, 0.80, NA, NA,…
## $ `Coliform (Quanti-Tray) (MPN /100mL)` <fct> <1, <1, <1, <1, <1, <1, <1, <1,…
## $ `E.coli(Quanti-Tray) (MPN/100mL)`     <fct> <1, <1, <1, <1, <1, <1, <1, <1,…
## $ `Week of Year`                        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Weekday                               <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ `Month Number`                        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Hour                                  <int> 12, 11, 10, 10, 9, 8, 9, 11, 7,…
## $ Month                                 <fct> January, January, January, Janu…
data %>% summary()
##   Sample Date         Sample Time       Sample Site       
##  Min.   :2015-01-01   Length:72709      Length:72709      
##  1st Qu.:2016-04-03   Class1:hms        Class :character  
##  Median :2017-06-26   Class2:difftime   Mode  :character  
##  Mean   :2017-06-14   Mode  :numeric                      
##  3rd Qu.:2018-09-11                                       
##  Max.   :2019-10-31                                       
##                                                           
##                Sample class     Location         Residual Free Chlorine (mg/L)
##  Operational         :27733   Length:72709       Min.   :-9.9900              
##  Compliance          :44289   Class :character   1st Qu.: 0.4500              
##  Resample_Compliance :  472   Mode  :character   Median : 0.5900              
##  Resample_Operational:    1                      Mean   : 0.5845              
##  Entry Point         :  168                      3rd Qu.: 0.7200              
##  Re-Sample           :   33                      Max.   : 2.2000              
##  Op-resample         :   13                      NA's   :3                    
##  Turbidity (NTU)   Fluoride (mg/L) Coliform (Quanti-Tray) (MPN /100mL)
##  Min.   : 0.0700   Min.   :0.03    <1     :72436                      
##  1st Qu.: 0.6200   1st Qu.:0.69    1      :  102                      
##  Median : 0.7300   Median :0.71    >200.5 :   33                      
##  Mean   : 0.7385   Mean   :0.71    2      :   28                      
##  3rd Qu.: 0.8600   3rd Qu.:0.73    3.1    :   20                      
##  Max.   :33.8000   Max.   :0.89    4.2    :    9                      
##  NA's   :506       NA's   :63336   (Other):   81                      
##  E.coli(Quanti-Tray) (MPN/100mL)  Week of Year      Weekday     
##  <1:72704                        Min.   : 1.00   Min.   :1.000  
##  - :    1                        1st Qu.:14.00   1st Qu.:2.000  
##  1 :    4                        Median :26.00   Median :4.000  
##                                  Mean   :26.15   Mean   :4.019  
##                                  3rd Qu.:38.00   3rd Qu.:6.000  
##                                  Max.   :53.00   Max.   :7.000  
##                                                                 
##   Month Number         Hour             Month      
##  Min.   : 1.000   Min.   : 4.00   August   : 6972  
##  1st Qu.: 4.000   1st Qu.: 9.00   July     : 6851  
##  Median : 6.000   Median :10.00   May      : 6796  
##  Mean   : 6.422   Mean   : 9.68   June     : 6630  
##  3rd Qu.: 9.000   3rd Qu.:11.00   September: 6578  
##  Max.   :12.000   Max.   :16.00   January  : 6521  
##                   NA's   :2       (Other)  :32361

5 Interrogating the data

5.1 Which date had the highest Turbidity reading?

top_10_turbidity = data %>% 
  arrange(desc(`Turbidity (NTU)`)) %>% 
  select(`Sample Date`, `Turbidity (NTU)`) %>%
  head(10)

top_10_turbidity %>%
  kable(caption = "Top 10 Turbidity readings") %>%
  kable_styling(bootstrap_options = c("hover"))
Top 10 Turbidity readings
Sample Date Turbidity (NTU)
2018-01-16 33.80
2017-05-19 27.10
2019-05-17 14.10
2017-12-15 6.97
2018-01-23 6.97
2017-11-27 6.68
2017-12-13 6.29
2016-05-16 6.04
2017-12-29 5.96
2018-02-21 5.89

The highest Turbidity rating was on 2018-01-16 with a reading of 33.8.

5.2 Is there a difference between the median readings for Turbidity, Chlorine, and Fluoride for the different types of sample sites?

class_medians = data %>% 
  group_by(`Sample class`) %>%
  summarise(med_chlorine = median(`Residual Free Chlorine (mg/L)`, na.rm = TRUE),
            med_turbidity = median(`Turbidity (NTU)`, na.rm = TRUE),
            med_flouride = median(`Fluoride (mg/L)`, na.rm = TRUE)) 
## `summarise()` ungrouping output (override with `.groups` argument)
class_medians %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover"))
Sample class med_chlorine med_turbidity med_flouride
Operational 0.69 0.75 0.71
Compliance 0.52 0.72 0.71
Resample_Compliance 0.47 0.70 NA
Resample_Operational 0.75 0.98 NA
Entry Point 0.63 0.72 0.72
Re-Sample 0.39 0.67 NA
Op-resample 0.73 0.70 0.72

5.3 Create a boxplot to visualise the difference between Entry Point and Operational levels of Residual Free Chlorine.

p = data %>% 
  filter(`Sample class` == "Entry Point" | 
         `Sample class` == "Operational") %>%
  ggplot(aes(x = `Sample class`, 
             y = `Residual Free Chlorine (mg/L)`)) +
  geom_boxplot(fill = "lightblue", color = "darkblue") +
  labs(title = "Residual Free Chlorine (mg/L) for different sample classes") +
  theme_bw()

ggplotly(p)

5.4 Which sample sites have the highest and lowest median readings for each chemical?

site_medians_wide = data %>% 
  group_by(`Sample Site`) %>%
  summarise(med_chlorine = median(`Residual Free Chlorine (mg/L)`, na.rm = TRUE),
            med_turbidity = median(`Turbidity (NTU)`, na.rm = TRUE),
            med_fluoride = median(`Fluoride (mg/L)`, na.rm = TRUE)) 
## `summarise()` ungrouping output (override with `.groups` argument)
# Tidy Way
site_medians_long = site_medians_wide %>%
  pivot_longer(!c(`Sample Site`),
               names_to = "Median Type",
               values_to = "Median Value")

max_min_median_sites = site_medians_long %>%
  group_by(`Median Type`) %>%
    summarise(
    Min_Val = min(`Median Value`, na.rm = TRUE),
    Min_Site = paste(`Sample Site`[which(`Median Value` == Min_Val)], 
                     collapse = ", "),
    Max_Val = max(`Median Value`, na.rm = TRUE),
    Max_Site = paste(`Sample Site`[which(`Median Value` == Max_Val)], 
                     collapse = ", ")
  )
## `summarise()` ungrouping output (override with `.groups` argument)
max_min_median_sites %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover"))
Median Type Min_Val Min_Site Max_Val Max_Site
med_chlorine 0.11 77750 0.91 1S03A
med_fluoride 0.69 32350 0.81 1S04
med_turbidity 0.45 3SC26 1.03 51900
# Non-Tidy Way -- copy code for each Median Type
# site_medians_wide %>%
#   select(`Sample Site`, med_turbidity) %>%
#   arrange(desc(med_turbidity)) %>%
#   filter(row_number() %in% c(1, n()))
# 
# site_medians_wide %>%
#   select(`Sample Site`, med_fluoride) %>%
#   arrange(desc(med_fluoride)) %>%
#   filter(row_number() %in% c(1, n()))
# 
# site_medians_wide %>%
#   select(`Sample Site`, med_chlorine) %>%
#   arrange(desc(med_chlorine)) %>%
#   filter(row_number() %in% c(1, n()))

5.5 Visualise the difference in readings between the top and bottom sites for Turbidity in different ways. Can you find anything interesting about the sites?

site_names = max_min_median_sites %>%
  filter(`Median Type` == "med_turbidity") %>%
  select(`Min_Site`, `Max_Site`) %>%
  t()

p = data %>% 
  filter(`Sample Site` %in% site_names) %>% 
  ggplot(aes(x = `Turbidity (NTU)`, fill = `Sample Site`)) +
    geom_histogram(bins = 30, color = "white") +
    scale_fill_manual(values = c("lightblue", "darkblue")) +
    theme_bw() +
    labs(y = "Frequency")

ggplotly(p)
p = data %>% 
  filter(`Sample Site` %in% site_names) %>% 
  ggplot(aes(x = `Sample Date`, y = `Turbidity (NTU)`, color = `Sample Site`))+
    geom_line() +
    scale_color_manual(values = c("lightblue", "darkblue")) +
    theme_bw()

ggplotly(p)

5.6 How have the median readings for each of the chemicals changed over time?

p = data %>% 
  
  group_by(`Sample Date`) %>%
  
  summarise(med_chlorine = median(`Residual Free Chlorine (mg/L)`, na.rm = TRUE),
            med_turbidity = median(`Turbidity (NTU)`, na.rm = TRUE),
            med_fluoride = median(`Fluoride (mg/L)`, na.rm = TRUE)) %>%
  
  pivot_longer(!c(`Sample Date`),
               names_to = "Median Type",
               values_to = "Median Value") %>%
  
  ggplot(aes(x = `Sample Date`, y = `Median Value`, colour = `Median Type`)) + 
    geom_line() +
    scale_color_manual(values = c("#ab5f54", "lightblue", "darkblue")) +
    theme_bw()
## `summarise()` ungrouping output (override with `.groups` argument)
ggplotly(p)

6 Your Own Research Questions

What questions do you have about the data? Insert your analysis here.

6.1 Executive Summary

This document sought to investigate the NYC Water Quality Monitoring dataset. Some initial issues were highlighted, including a lack of consistency in detail between measures presented in the report and the data itself. Exploratory data analysis revealed interesting patterns in the data and a variety of issues for further analysis, such as missing data, skew, and outliers. Preliminary investigation of the distribution of residual free chlorine by month over the whole dataset revealed that most months are somewhat roughly Gaussian, however, the distributions do appear leptykurtic (very high probability density in the middle) - as evidenced by the skinny distributions.

6.2 Research Question: Is residual free chlorine normally distributed across months for the entire dataset?

data %>%
  mutate(year = year(`Sample Date`)) %>%
  ggplot(aes(x = `Residual Free Chlorine (mg/L)`)) +
  geom_histogram(alpha = 0.9, fill = "steelblue2", binwidth = 0.01) +
  labs(title = "Distribution of residual free chlorine by month for 2019",
       x = "Residual Free Chlorine (mg/L)",
       y = "Frequency") +
  facet_wrap(~`Month Number`, scales = "free")